Model Selection

Image-Text Understanding

# Image-Text Understanding

Qwen2 VL 7B Instruct GGUF

A quantized version of the multimodal model based on Qwen2-VL-7B-Instruct, supporting image-text-to-text tasks with various quantization levels.

Image-to-Text English

Razorback 12B V0.2

Razorback 12B v0.2 is a multimodal model combining the strengths of Pixtral 12B and UnslopNemo v3, featuring visual understanding and language processing capabilities.

Transformers Supports Multiple Languages

A vision-language model based on Microsoft's Phi-1.5 architecture, combined with CLIP for image processing capabilities

Transformers Supports Multiple Languages

Llava 1.6 Mistral 7b Gguf

LLaVA is an open-source multimodal chatbot, trained by fine-tuning LLM on multimodal instruction-following data. This version is the GGUF quantized version, offering multiple quantization options.

Llava-Phi2 is a multimodal implementation based on Phi2, combining vision and language processing capabilities, suitable for image-text-to-text tasks.

Transformers English

MMAlaya is a multimodal system developed based on the large language model Alaya, comprising three core components: a large language model, an image-text feature encoder, and a feature transformation module.

Llava V1.5 13B AWQ

LLaVA is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna.

Llava Pretrain Vicuna 7b V1.3

LLaVA is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase